HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes.

نویسندگان

  • M E Hawley
  • K K Kidd
چکیده

A DNA haplotype system combines information from two or more distinct polymorphic systems located within a small length of DNA in which there is little recombination. Each possible combination of alleles, one from each component system, constitutes a distinct haplotype that is treated as an allele of the haplotype system. Combining the information from several polymorphic systems (such as RFLPs) into such a haplotype system defines alleles that arise from recombination among as well as mutation at the individual systems. Haplotype systems generally have been found to show a great deal of variability within and among populations, and thus can be more informative for genetic comparisons between populations and for linkage mapping purposes than the component systems treated independently. Even when all component systems are codominant, a characteristic of haplotype systems is ambiguity, by which we mean that some phenotypes may correspond to several genotypes. When an individual is heterozygous for no more than one of the component systems, the genotype is uniquely specified; there is no ambiguity. However, even the simplest haplotype system of two loci, each with two alleles, can present an ambiguous phenotype: an individual who is heterozygous at both sites (typing Aa and Bb) may have either genotype AB/ab or Ab/aB. Sometimes these ambiguities can be resolved when data are available for related individuals; they are usually not directly resolvable for an "isolated" individual, but see Stephens et al. (1990) and Ruano et al. (1990). Another characteristic of real datasets is missing data on one (or more) of the component loci. Because it is now common for there to be five or more polymorphic loci close enough together to be haplotyped, it is often very time-consuming and expensive to go back to complete typing for every system on all individuals. As the number of polymorphic sites (loci) in a haplotype increases and the number of alleles at one or more sites increases above two, the proportion of individuals with ambiguity can increase dramatically; so can the difficulty in having complete typing data on all individuals. Existing computer programs have limitations. Those of Weir (1990) are limited to analysis of two loci with two codominant alleles each and to data sets with complete typing information, and do not allow any known phase information on individuals to be incorporated. Clark (1990)) describes an algorithm for determining a minimally sufficient set of haplotypes to explain an observed set of phenotypes; it does not estimate frequencies. Long et al. (1995) describe a program that uses the EM algorithm of Dempster et al. (1977) to simultaneously estimate allele frequencies and the necessary lower and higher order disequilibrium coefficients to determine haplotype frequencies. Though the program can handle one recessive allele at each site (the rest being codominant), it cannot use known phase information or include incomplete data. We have written a FORTRAN program, HAPLO, that implements the EM algorithm to estimate haplotype frequencies from phenotype data on samples of unrelated individuals. The EM algorithm is a generalized iterative maximum likelihood approach to estimation that is useful when data are ambiguous and/or incomplete. Our implementation is for autosomal loci in Hardy-Weinberg proportions; we are working to extend it to X-linked systems. In addition to our desire that it should analyze large haplotype systems, HAPLO was specifically designed to deal with the two limitations in the existing programs that were most relevant to our studies: (1) incomplete data on some individuals due to failure of typing for one (or more) of the component loci, and (2) the availability of data on relatives that allows complete or partial resolution of the genotype for some individuals with otherwise ambiguous phenotypes. Both situations were common for much of the DNA marker data being collected in our lab. We have used this program to estimate frequencies of haplotypes at several different loci in different human populations (Kidd et al. 1993; Lu et al., in press). With the growing use of DNA technology, studies of natural populations of many species can involve haplotype systems, making this program more broadly useful. We also note that, although motivated by haplotype studies, HAPLO treats each haplotype as an allele in a multi-allelic system and does not use the underlying nature of the data to obtain frequency estimates. In this sense, it is no different from frequency estimation for any multi-allelic system except that it allows the more complicated genotype-phenotype correspondences resulting from ambiguity and missing data. The EM algorithm is an iterative process; each iteration gives a set of frequency estimates that converge to stable maximum likelihood estimates. The iterations start with all haplotypes (alleles) at equal frequency. It is easy to show that the frequency estimates for haplotypes that are not definitely "observed," i.e., required to explain a phenotype, will go to zero. This is what is expected for a maximum likelihood estimate, as discussed by Clark (1990) in his description of an algorithm for determining the minimum set of haplotypes required to explain a sample of phenotypes. This would appear to indicate that, if the set of phenotypes in the dataset is sufficiently clear to show from simple inspection that a possible haplotype is not required to explain the data, it can be omitted from the genetic model without altering the ultimate frequency estimates. This is usually true, but one interesting counterexample we observed in one dataset shows that "required" has a probabilistic aspect. We observed one individual with a multiply heterozygous phenotype that could be explained by several genotypes. Only two of those possibilities involved haplotypes otherwise definitely present in the sample. In these two possibilities, however, a different common haplotype was heterozygous with a different haplotype not otherwise seen. Neither of these "new" haplotypes was absolutely required, but one or the other had to be present. Their maximum likelihood frequency estimates were fractions of 1/2N, proportional to the frequency estimates of the two common haplotypes. Thus, although prior elimination of haplotypes can be valid, care must be used. The HAPLO program optionally estimates standard errors in two ways. First a jackknife procedure is used. Estimates of all haplotype frequencies are recalculated with each individual in turn removed from the data set. For each haplotype the standard deviation of those frequency esti-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application to Estimate Haplotypes for Multiallelic Present-Absent Loci

The article presents an algorithm and an application to estimate haplotype frequencies from genotype data for unrelated individuals. Presented approach can handle loci with multiple alleles as well as silent (null) alleles. The mathematical model and an expanded Expectation-Maximization algorithm is described. The computer program, called NullHap, available freely at http://staff.elka.pw.edu.pl...

متن کامل

HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination

MOTIVATION Haplotype reconstruction is an essential step in genetic linkage and association studies. Although many methods have been developed to estimate haplotype frequencies and reconstruct haplotypes for a sample of unrelated individuals, haplotype reconstruction in large pedigrees with a large number of genetic markers remains a challenging problem. METHODS We have developed an efficient...

متن کامل

Haplotype inference for present-absent genotype data using previously identified haplotypes and haplotype patterns

MOTIVATION Killer immunoglobulin-like receptor (KIR) genes vary considerably in their presence or absence on a specific regional haplotype. Because presence or absence of these genes is largely detected using locus-specific genotyping technology, the distinction between homozygosity and hemizygosity is often ambiguous. The performance of methods for haplotype inference (e.g. PL-EM, PHASE) for K...

متن کامل

Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms.

The mapping of SNPs in human genomes has generated a lot of interest from both the biomedical research community and industry. In conjunction with SNP mapping, researchers have shown that haplotypes possess considerably greater potential than the traditional single-SNP approach in disease-gene mapping and in our understanding of complex landscapes of linkage disequilib-rium (LD) (Goldstein 2001...

متن کامل

Estimating population haplotype frequencies from pooled SNP data using incomplete database information

MOTIVATION Information about haplotype structures gives a more detailed picture of genetic variation between individuals than single-locus analyses. Databases that contain the most frequent haplotypes of certain populations are developing rapidly (e.g. the HapMap database for single-nucleotide polymorphisms in humans). Utilization of such prior information about the prevailing haplotype structu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • The Journal of heredity

دوره 86 5  شماره 

صفحات  -

تاریخ انتشار 1995